Skip to main content

CPython Architecture

What Actually Happens When You Run print("hello")?

A junior engineer writes print("hello") and it works. A senior engineer on-call at 3am faces a Python process consuming 12 CPU cores with zero throughput improvement, a memory leak growing 50MB per hour, and a stack trace that ends somewhere in ceval.c. The difference in their ability to diagnose that incident is not the amount of Python they know - it is whether they have a model of what CPython actually does.

So let us answer the question properly. When you run:

python3 hello.py

Where hello.py contains only print("hello"), here is what actually happens at the C level:

  1. The OS loads the python3 binary, allocates stack and heap, sets up process state
  2. CPython initialises _PyRuntime - the global singleton holding all interpreter state
  3. The source file is opened and passed to the tokeniser (Parser/tokenize.c)
  4. Tokens feed into the PEG parser (Parser/parser.c), producing a Concrete Syntax Tree
  5. The CST is transformed into an Abstract Syntax Tree (Python/ast.c)
  6. The compiler traverses the AST and emits bytecode (Python/compile.c)
  7. The bytecode is packaged into a PyCodeObject
  8. The interpreter loop (Python/ceval.c) executes the code object
  9. The CALL opcode eventually calls the C function builtin_print in Python/bltinmodule.c
  10. builtin_print calls PyFile_WriteObjectfwrite → a write(2) syscall

Ten steps. A C program from start to finish. None of this is magic.

The CPython Source Tree

CPython is open source. Clone it and follow along:

git clone https://github.com/python/cpython.git
cd cpython
git checkout v3.13.0

The directories that matter for this module:

cpython/
├── Objects/ ← Every built-in type is defined here
│ ├── object.c ← Base PyObject: Py_INCREF, Py_DECREF, repr, hash
│ ├── longobject.c ← int type - ~5,000 lines of C
│ ├── floatobject.c ← float type
│ ├── listobject.c ← list type: append, insert, sort, resize
│ ├── dictobject.c ← dict type: hash table with open addressing
│ ├── typeobject.c ← type type: metaclass machinery, MRO computation
│ └── funcobject.c ← function objects, closures

├── Python/ ← The interpreter engine
│ ├── ceval.c ← THE eval loop: _PyEval_EvalFrameDefault (~3000 lines)
│ ├── ceval_gil.c ← GIL implementation
│ ├── compile.c ← AST → bytecode compiler
│ ├── ast.c ← AST construction and validation
│ ├── symtable.c ← Symbol table (local/global/free variable analysis)
│ ├── gc.c ← Cyclic garbage collector
│ └── import.c ← import machinery, sys.modules

├── Include/ ← Public C API headers
│ ├── object.h ← PyObject struct definition
│ ├── cpython/ ← CPython-specific (not part of the stable ABI)
│ └── internal/ ← Implementation internals
│ ├── pycore_object.h ← Internal object representation
│ ├── pycore_runtime.h ← _PyRuntimeState (GIL lives here)
│ └── pycore_frame.h ← _PyInterpreterFrame (3.11+)

├── Modules/ ← Standard library C modules
│ ├── _io/ ← io module
│ ├── socketmodule.c
│ └── _csv.c

└── Lib/ ← Pure Python standard library
├── os.py
├── collections/__init__.py
└── pathlib.py

The key insight: Objects/ defines what Python data looks like in memory. Python/ defines how the interpreter processes it. Everything else is built on those foundations.

The Execution Pipeline

From .py file to program output, CPython passes your code through six distinct stages:

┌─────────────────────────────────────────────────────────────────────┐
│ CPython Execution Pipeline │
└─────────────────────────────────────────────────────────────────────┘

hello.py (source text: "print("hello")")


┌──────────────┐
│ Tokeniser │ Parser/tokenize.c
│ │ Converts text into a stream of tokens
└──────────────┘ NAME('print') OP('(') STRING('"hello"') OP(')')


┌──────────────┐
│ PEG Parser │ Parser/parser.c (replaced LL(1) parser in 3.9)
│ │ Tokens → Concrete Syntax Tree
└──────────────┘ Validates syntax: SyntaxError raised here if malformed


┌──────────────┐
│ AST Builder │ Python/ast.c
│ │ CST → Abstract Syntax Tree
└──────────────┘ Strips syntactic sugar, normalises structure
│ e.g., augmented assignment a += b → a = a + b

┌──────────────────────┐
│ Compiler │ Python/compile.c + Python/symtable.c
│ + Symbol Table │ Analyses scopes, classifies variables
│ │ AST → sequence of bytecode instructions
└──────────────────────┘ Applies peephole/AST optimisations


┌───────────────────┐
│ PyCodeObject │ The compiled result (immutable)
│ │ co_code: raw bytecode bytes
│ │ co_consts: ('hello',)
└───────────────────┘ co_names: ('print',)
│ co_varnames: () (no locals in module scope)

┌─────────────────────────────────────────────────┐
│ Interpreter Loop (Python/ceval.c) │
│ _PyEval_EvalFrameDefault() │
│ │
│ for each instruction: │
│ opcode = next_instr->opcode │
│ switch(opcode) { case LOAD_NAME: ... } │
└─────────────────────────────────────────────────┘


stdout: hello

You can observe every stage from Python:

import ast
import dis
import tokenize
import io

source = 'print("hello")'

# Stage 1: Tokenisation
tokens = list(tokenize.generate_tokens(io.StringIO(source).readline))
for tok in tokens:
print(tok)
# TokenInfo(type=1 (NAME), string='print', ...)
# TokenInfo(type=54 (OP), string='(', ...)
# TokenInfo(type=3 (STRING), string='"hello"', ...)
# TokenInfo(type=54 (OP), string=')', ...)
# TokenInfo(type=4 (NEWLINE),string='', ...)
# TokenInfo(type=0 (ENDMARKER), string='', ...)

# Stage 3: AST
tree = ast.parse(source)
print(ast.dump(tree, indent=2))
# Module(
# body=[
# Expr(
# value=Call(
# func=Name(id='print', ctx=Load()),
# args=[Constant(value='hello')],
# keywords=[]))],
# type_ignores=[])

# Stage 5+6: Bytecode
code = compile(source, '<string>', 'exec')
dis.dis(code)
# 1 RESUME 0
# PUSH_NULL
# LOAD_NAME 0 (print)
# LOAD_CONST 0 ('hello')
# CALL 1
# POP_TOP
# LOAD_CONST 1 (None)
# RETURN_VALUE

The Evaluation Loop: _PyEval_EvalFrameDefault

The heart of CPython is a single C function: _PyEval_EvalFrameDefault in Python/ceval.c. This function is approximately 3,000 lines long and has executed every Python program since 1991.

Its structure is a dispatch loop over bytecode instructions:

// Simplified pseudocode of the eval loop
// Real code uses computed gotos and specialised opcode variants
PyObject *
_PyEval_EvalFrameDefault(PyThreadState *tstate, _PyInterpreterFrame *frame, int throwflag)
{
// The value stack - operands live here
PyObject **stack_pointer = _PyFrame_GetStackPointer(frame);

// The bytecode instruction pointer
_Py_CODEUNIT *next_instr = frame->prev_instr + 1;

for (;;) {
_Py_CODEUNIT word = *next_instr++;
int opcode = _Py_OPCODE(word);
int oparg = _Py_OPARG(word);

switch (opcode) {
case LOAD_FAST: {
// Read local variable at index oparg
PyObject *value = GETLOCAL(oparg);
if (value == NULL) {
// UnboundLocalError
_PyErr_SetString(tstate, PyExc_UnboundLocalError, ...);
goto error;
}
Py_INCREF(value); // Stack now holds a reference
PUSH(value);
DISPATCH(); // Back to top of loop
}

case LOAD_GLOBAL: {
// Look up name in globals dict, then builtins
PyObject *name = GETITEM(frame->f_code->co_names, oparg >> 1);
PyObject *value;
if (oparg & 1) {
// LOAD_GLOBAL with PUSH_NULL variant (3.11+)
PUSH(NULL);
}
value = PyDict_GetItemWithError(GLOBALS(), name);
if (value == NULL) {
value = PyDict_GetItemWithError(BUILTINS(), name);
if (value == NULL) {
// NameError
goto error;
}
}
Py_INCREF(value);
PUSH(value);
DISPATCH();
}

case BINARY_OP: {
PyObject *right = POP();
PyObject *left = TOP();
// Dispatch to type-specific arithmetic via nb_add etc.
PyObject *result = binary_ops[oparg](left, right);
Py_DECREF(left);
Py_DECREF(right);
SET_TOP(result);
if (result == NULL) goto error;
DISPATCH();
}

case CALL: {
// Full function call - allocate new frame, recurse
// ... see the full discussion below
break;
}

case RETURN_VALUE: {
retval = POP();
// Clean up frame, return to calling frame
goto return_or_yield;
}
// ... ~150 more cases in the real code
}
}
}

Two structural points:

The value stack - operands are pushed and popped from a C stack (stack_pointer). LOAD_FAST pushes a value; BINARY_OP pops two values and pushes the result. CPython is a stack machine, not a register machine. The stack lives in the frame's localsplus array, above the local variables.

Py_INCREF everywhere - every time a value is pushed onto the value stack, its reference count is incremented. Every time it is popped and discarded, it is decremented. Reference counting is not incidental - it is woven into every single opcode.

PyObject: The Universal Base

Every Python object - integers, strings, lists, functions, classes - is represented in C as a PyObject. Definition from Include/object.h:

// Every Python object starts with this 16-byte header (on 64-bit)
typedef struct _object {
Py_ssize_t ob_refcnt; // Reference count (8 bytes)
PyTypeObject *ob_type; // Pointer to type (8 bytes)
} PyObject;

That is 16 bytes minimum for every Python object, regardless of what it holds. The ob_refcnt field determines when memory is freed (hits zero → freed). The ob_type pointer is CPython's equivalent of a vtable - it points to a large C struct containing function pointers for every operation the type supports.

Concrete types extend this header with their own fields:

// Python int - PyLongObject (simplified for 3.12+ compact form)
typedef struct {
PyObject ob_base; // 16-byte header
// In 3.12+: small integers (< 2^30) are stored compactly inline
// The value is packed into the ob_size field and trailing digits
Py_ssize_t long_value; // For single-digit integers
} PyLongObject;
// Total for small int: ~28 bytes (header + lv_tag + one digit)

// Python float
typedef struct {
PyObject ob_base; // 16-byte header
double ob_fval; // The actual double (8 bytes)
} PyFloatObject;
// Total: 24 bytes

// Python list
typedef struct {
PyObject ob_base; // 16-byte header
Py_ssize_t ob_size; // Current number of items (8 bytes)
PyObject **ob_item; // Pointer to array of PyObject* (8 bytes)
Py_ssize_t allocated; // Allocated capacity (8 bytes)
} PyListObject;
// Total: 40 bytes + heap allocation for ob_item array

You can verify sizes from Python:

import sys

print(sys.getsizeof(True)) # 28 - PyBoolObject (PyLongObject subtype)
print(sys.getsizeof(0)) # 24 - compact int
print(sys.getsizeof(3.14)) # 24 - PyFloatObject
print(sys.getsizeof([])) # 56 - PyListObject (empty, no ob_item alloc)
print(sys.getsizeof([1, 2, 3])) # 88 - 56 + 3 * 8 bytes for ob_item array
print(sys.getsizeof({})) # 64 - empty dict
print(sys.getsizeof("hello")) # 54 - PyUnicodeObject + 5 chars + null

Note that sys.getsizeof returns the size of the object itself, not the total memory reachable from it. A list of 1000 integers reports 8056 bytes (56 + 1000*8), but each integer also occupies its own 28-byte allocation.

PyFrameObject and the Frame Stack

Every function call in Python creates a frame object that holds all the state for that invocation. In Python 3.11+, this is _PyInterpreterFrame (defined in Include/internal/pycore_frame.h):

// _PyInterpreterFrame (Python 3.11+) - simplified
struct _PyInterpreterFrame {
PyCodeObject *f_code; // The code object being executed
_PyInterpreterFrame *previous; // The calling frame (linked list)
PyObject *f_globals; // globals dict for this frame
PyObject *f_builtins; // builtins dict for this frame
PyObject *f_locals; // locals dict (only materialised on demand)
PyObject *f_funcobj; // The function object (or NULL for modules)
int f_lasti; // Offset of last executed instruction
int f_lineno; // Current line number
// The localsplus array follows immediately in memory:
// [0 .. co_nlocals-1] : local variables
// [co_nlocals .. co_nlocals+ncellvars-1] : cell variables
// [... + nfreevars] : free variables
// [... onwards] : value stack
PyObject *localsplus[1]; // Variable-length array
};

The localsplus array is the key design decision. In a single contiguous block of memory, it holds local variables, closure cells, free variables, and the value stack. This means:

  • LOAD_FAST 0 is frame->localsplus[0] - a direct C array index, no hash table
  • The value stack pointer (stack_pointer) just moves up and down within the same allocation

A frame for a function with 5 locals and max stack depth 4 needs (5 + 4) * 8 = 72 bytes of localsplus, plus the fixed fields.

import sys
import inspect

def count_args(a, b, c):
x = a + b
y = x + c
return y

frame_info = count_args.__code__
print(f"co_varnames: {frame_info.co_varnames}") # ('a', 'b', 'c', 'x', 'y')
print(f"co_nlocals: {frame_info.co_nlocals}") # 5
print(f"co_stacksize: {frame_info.co_stacksize}") # 2 (max 2 items on stack at once)

# Watch frames during execution
def show_frame_chain():
frame = sys._getframe()
depth = 0
while frame is not None:
print(f" {' ' * depth}{frame.f_code.co_name} "
f"(line {frame.f_lineno}, {frame.f_code.co_filename})")
frame = frame.f_back
depth += 1

def outer():
def inner():
show_frame_chain()
inner()

outer()
# show_frame_chain (line X, <stdin>)
# inner (line X, <stdin>)
# outer (line X, <stdin>)
# <module> (line X, <stdin>)

Reading dis.dis() Output

The dis module translates bytecode back into human-readable form. It is the single most useful tool for understanding CPython behaviour:

import dis

def add_and_double(a, b):
result = a + b
return result * 2

dis.dis(add_and_double)

Output on Python 3.12:

2 0 RESUME 0

3 2 LOAD_FAST 0 (a)
4 LOAD_FAST 1 (b)
6 BINARY_OP 0 (+)
10 STORE_FAST 2 (result)

4 12 LOAD_FAST 2 (result)
14 LOAD_CONST 1 (2)
16 BINARY_OP 5 (*)
20 RETURN_VALUE

Column guide:

  • Col 1: Source line number (only shown for the first instruction per line)
  • Col 2: Bytecode offset in bytes (each instruction is 2 bytes wide in Python 3.6+)
  • Col 3: Opcode name
  • Col 4: Opcode argument (integer index into co_varnames, co_consts, etc.)
  • Col 5: Human-readable argument value (name or constant)

Mapping each opcode to its ceval.c action:

OpcodeArgumentC-level action
RESUME0Entry-point marker; checks eval breaker (GIL, signals)
LOAD_FAST0 (a)value = localsplus[0]; Py_INCREF(value); PUSH(value)
LOAD_FAST1 (b)value = localsplus[1]; Py_INCREF(value); PUSH(value)
BINARY_OP0 (+)right=POP(); left=POP(); res=PyNumber_Add(left,right); PUSH(res)
STORE_FAST2 (result)value=POP(); old=localsplus[2]; localsplus[2]=value; Py_XDECREF(old)
LOAD_CONST1 (2)value = co_consts[1]; Py_INCREF(value); PUSH(value)
BINARY_OP5 (*)right=POP(); left=POP(); res=PyNumber_Multiply(left,right); PUSH(res)
RETURN_VALUE-retval=POP(); clean up frame; return retval

For more detail, use dis.Bytecode which gives you structured access:

import dis

def classify(n):
if n > 0:
return "positive"
elif n < 0:
return "negative"
return "zero"

# Get structured bytecode objects
bc = dis.Bytecode(classify)
for instr in bc:
print(f" {instr.offset:3d} {instr.opname:<20s} {instr.argval!r}")

Accessing CPython Internals from Python

You do not need to compile CPython to inspect its internals. Python ships several modules that expose the C layer:

import sys
import gc
import dis
import inspect
import ctypes

# --- sys module ---
print(sys.getsizeof([])) # 56 - bytes of a PyListObject
print(sys.getrefcount(None)) # Very high - None is referenced everywhere
print(sys.getswitchinterval()) # 0.005 - GIL switch every 5ms
version = sys.version_info
print(f"Python {version.major}.{version.minor}.{version.micro}")

# --- gc module ---
print(gc.get_threshold()) # (700, 10, 10) - gen0, gen1, gen2 thresholds
print(gc.get_count()) # Objects in each generation right now
print(gc.isenabled()) # True by default
gc.collect() # Force a collection cycle

# --- inspect module ---
def greet(name: str, times: int = 1) -> str:
return (f"Hello, {name}!" * times)

sig = inspect.signature(greet)
print(sig) # (name: str, times: int = 1) -> str

code = greet.__code__
print(f"co_varnames: {code.co_varnames}") # ('name', 'times')
print(f"co_consts: {code.co_consts}") # (None, 1, 'Hello, ')
print(f"co_argcount: {code.co_argcount}") # 2
print(f"co_stacksize: {code.co_stacksize}") # Max stack depth

# --- ctypes: reading ob_refcnt directly from C memory ---
x = [1, 2, 3]
# ob_refcnt is the first 8 bytes of PyObject at id(x)
refcnt = ctypes.c_ssize_t.from_address(id(x)).value
print(f"ob_refcnt via ctypes: {refcnt}") # 2 (x + ctypes temp ref)
print(f"sys.getrefcount(x): {sys.getrefcount(x)}") # 3 (x + ctypes + call)

# ob_type is the next 8 bytes - it is the address of the type object
ob_type_ptr = ctypes.c_ssize_t.from_address(id(x) + 8).value
print(f"ob_type address: {hex(ob_type_ptr)}")
print(f"id(list): {hex(id(list))}") # Should match

Tracing a Complete Function Execution

Putting it all together: a complete trace of one function call through every layer:

def multiply(a, b):
return a * b

result = multiply(3, 7)

At the C level, when multiply(3, 7) is called:

Caller frame (module level):
Executing: CALL 2

1. CALL opcode pops from value stack:
- function object: <function multiply at 0x...>
- arg 0: PyLongObject for 3 (ob_refcnt++)
- arg 1: PyLongObject for 7 (ob_refcnt++)

2. _PyObject_Vectorcall(func, args, 2, NULL) is called
- Checks func->ob_type->tp_vectorcall (fast path for functions)
- Calls _PyFunction_Vectorcall()

3. A new _PyInterpreterFrame is allocated:
- On the C stack if it fits in the "frame cache" (3.11+)
- Falls back to heap allocation if the C stack is insufficient
- localsplus[0] = 3 (with Py_INCREF)
- localsplus[1] = 7 (with Py_INCREF)
- frame->f_code = multiply.__code__
- frame->previous = caller_frame

4. _PyEval_EvalFrameDefault(tstate, new_frame, 0) is called recursively:
LOAD_FAST 0 (a) → Py_INCREF(3), PUSH(3)
LOAD_FAST 1 (b) → Py_INCREF(7), PUSH(7)
BINARY_OP 5 (*) → POP() = 7, POP() = 3
long_mul(3, 7) in longobject.c
allocates new PyLongObject for 21
PUSH(21)
Py_DECREF(3), Py_DECREF(7)
RETURN_VALUE → retval = POP() = 21
frame deallocation begins
Py_DECREF(localsplus[0] = 3)
Py_DECREF(localsplus[1] = 7)

5. Back in caller frame: STORE_FAST stores pointer to PyLong(21)

Every function call in Python goes through this path. This overhead - struct allocation, reference counting, opcode dispatch - is why Python is ~100x slower than C for tight numeric loops.

Interview Q&A

Q1: What is the difference between CPython, PyPy, and Jython? When would you choose each?

CPython is the reference implementation - it is the python3 binary you get when installing Python. It compiles Python source to bytecode and interprets that bytecode in a C evaluation loop. It has the most complete library support, the most predictable performance characteristics, and is what you should use by default in production.

PyPy is a Python interpreter written in RPython (a restricted, statically-typeable subset of Python). Its key feature is a tracing JIT compiler that observes hot code paths and compiles them to native machine code. For CPU-bound pure-Python workloads, PyPy is typically 5-10x faster than CPython. The tradeoffs: startup latency before JIT warmup, higher peak memory usage, and compatibility issues with C extensions that depend on CPython ABI internals (cffi-based extensions work well; Cython-based ones may not).

Jython compiles Python to JVM bytecode. It has no GIL (JVM handles threading), enabling true thread-based parallelism. The tradeoffs: effectively frozen at Python 2 compatibility, limited C extension support, not suitable for new projects.

Choose CPython for all production systems. Consider PyPy only for long-running CPU-bound services written in pure Python. Avoid Jython in new work.

Q2: Why are Python function calls significantly slower than C function calls?

A Python function call involves: (1) opcode dispatch for CALL in the eval loop, (2) looking up the function object - either from LOAD_FAST (local) or LOAD_GLOBAL (dict lookup), (3) type-checking the callable object, (4) calling _PyObject_Vectorcall, (5) allocating a _PyInterpreterFrame (typically 8-10 cache lines of memory), (6) copying arguments into localsplus, (7) applying default values and keyword argument normalisation, (8) entering the eval loop again for the callee, and (9) the entire process in reverse on return - plus Py_INCREF/Py_DECREF at nearly every step.

A C function call is a single CALL instruction moving the instruction pointer, one stack frame push, and function body execution. The overhead ratio is roughly 50-100x for a do-nothing function. This is why tight inner loops should use list comprehensions (which avoid Python function call overhead), NumPy operations, or compiled code.

Q3: What is a PyCodeObject and what does it contain?

A PyCodeObject is the compiled representation of a Python function, class body, or module - produced by the compiler, stored in .pyc files, and shared across all calls to the function (unlike frames, which are per-call). Key fields:

  • co_code: The raw bytecode as a bytes object (each instruction is 2 bytes: opcode + arg)
  • co_consts: Tuple of constants referenced by LOAD_CONST (literals, nested code objects)
  • co_names: Tuple of global and attribute names referenced by LOAD_GLOBAL, LOAD_ATTR
  • co_varnames: Tuple of local variable names in position order (index 0 = first argument)
  • co_freevars: Variables captured from enclosing scopes (for closures)
  • co_cellvars: Variables captured by nested closures defined inside this function
  • co_argcount: Number of positional arguments
  • co_stacksize: Maximum value stack depth needed - used to size the frame
  • co_filename, co_firstlineno: For tracebacks and inspect.getsource()

The code object is immutable. The frame is created fresh for each call and holds the mutable runtime state (current instruction pointer, local variable values, value stack).

Q4: How does Python resolve variable names? Explain LOAD_FAST, LOAD_GLOBAL, and LOAD_DEREF.

At compile time, symtable.c analyses each function and classifies every name as LOCAL, GLOBAL, FREE, or CELL. This classification determines which opcode is emitted - the resolution strategy is fixed at compile time, not runtime.

LOAD_FAST: For local variables (those assigned within the function). At runtime: value = localsplus[i] - a direct C array index. Zero hash table lookups. This is the fastest possible variable access in Python.

LOAD_GLOBAL: For names not assigned locally - globals and builtins. At runtime: calls PyDict_GetItemWithError(globals, name), then falls back to builtins. This is a hash table lookup. In Python 3.11+, LOAD_GLOBAL has a specialised variant that caches the dict version number and uses a cached index, avoiding the hash lookup on warm hits.

LOAD_DEREF: For free variables - variables from enclosing scopes captured by a closure. At runtime: reads from a PyCellObject, which is an extra indirection. Cell objects allow the enclosing scope to update the variable after the closure is created: cell->ob_ref is the actual value.

Q5: What happens at the C level when you import json the first time vs. the second time?

First import: __import__('json') triggers the import machinery in Python/import.c. The machinery: (1) checks sys.modules['json'] - not present, (2) searches sys.path for json (finds Lib/json/__init__.py), (3) opens and reads the source file, (4) calls compile() to produce a PyCodeObject, (5) allocates a fresh PyModuleObject with an empty __dict__, (6) sets sys.modules['json'] = module (before execution, to handle circular imports), (7) executes the module's bytecode in the module's __dict__ namespace - this populates json.loads, json.dumps, json.JSONDecodeError, etc., (8) returns the module. This involves file I/O, compilation, and executing hundreds of lines of Python initialisation code.

Second import: __import__('json') checks sys.modules first. This is a single dict hash lookup taking nanoseconds. The cached module object is returned immediately. No file I/O, no compilation, no module body execution. This is why top-level import statements in modules are effectively free after the first call and why you should never worry about importing in a module's global scope.

© 2026 EngineersOfAI. All rights reserved.